public domain
- Europe > Germany (0.28)
- North America > Canada > British Columbia (0.04)
- North America > United States > Virginia (0.04)
- (18 more...)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.67)
- Law > Statutes (1.00)
- Law > Litigation (1.00)
- Law > International Law (1.00)
- (10 more...)
Checklist 1. For all authors (a)
Another limitation is that the linear model seems to outperform the rank-one quadratic model; we do not fully understand this effect, as discussed in the last paragraph of section 4. A third limitation is that models need to be averaged across time to obtain a single, deployable model: see Figure 5. A final limitation is that we do not yet have convergence theorems or regret bounds for the passive-aggressive updates in these models; see the second paragraph of section 5. (c) Did you discuss any potential negative societal impacts of your work?
- Europe > Germany (0.28)
- North America > Canada > British Columbia (0.04)
- North America > United States > Virginia (0.04)
- (18 more...)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.67)
- Law > Statutes (1.00)
- Law > Litigation (1.00)
- Law > International Law (1.00)
- (10 more...)
Checklist 1. For all authors (a)
Another limitation is that the linear model seems to outperform the rank-one quadratic model; we do not fully understand this effect, as discussed in the last paragraph of section 4. A third limitation is that models need to be averaged across time to obtain a single, deployable model: see Figure 5. A final limitation is that we do not yet have convergence theorems or regret bounds for the passive-aggressive updates in these models; see the second paragraph of section 5. (c) Did you discuss any potential negative societal impacts of your work?
Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding
U.S. health insurance is complex, and inadequate understanding and limited access to justice have dire implications for the most vulnerable. Advances in natural language processing present an opportunity to support efficient, case-specific understanding, and to improve access to justice and healthcare. Yet existing corpora lack context necessary for assessing even simple cases. We collect and release a corpus of reputable legal and medical text related to U.S. health insurance. We also introduce an outcome prediction task for health insurance appeals designed to support regulatory and patient self-help applications, and release a labeled benchmark for our task, and models trained on it.
- North America > United States > California (0.05)
- North America > United States > Ohio (0.04)
- North America > United States > New York (0.04)
- (4 more...)
- Law (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- (5 more...)
Utah's High-Stakes PR Campaign to Wrest Control of Public Lands
Utah Attorney General Sean Reyes speaks at the Utah State Capitol in Salt Lake City, last year after state leaders announced they are suing the federal government over 18.5 million acres of Bureau of Land Management land, which covers about 34% of Utah.Saige Miller / KUER via High Country News This story was originally published by High Country News and Public Domain and is reproduced here as part of the Climate Desk collaboration. Last year, as Utah prepared to file a federal lawsuit aiming to take control of millions of acres of federal public land within its borders, state officials sought help swaying public opinion in their favor. So they turned to a group of public relations professionals at Penna Powers, a media and branding firm based in Salt Lake City. Backed with a commitment of more than two million in taxpayer funds, the firm sprang into action. One of the early orders of business was studying the opposition. In June 2024, an assistant attorney general sent an email to numerous state government colleagues and Penna Powers staffers that contained a video from the Theodore Roosevelt Conservation Partnership (TRCP) in which the well-known hunter and media personality Randy Newberg described the dangers of transferring federal land to state control.
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.45)
- North America > United States > District of Columbia > Washington (0.04)
- North America > United States > Colorado (0.04)
- Law (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Langlais, Pierre-Carl, Hinostroza, Carlos Rosas, Nee, Mattia, Arnett, Catherine, Chizhov, Pavel, Jones, Eliot Krzystof, Girard, Irène, Mach, David, Stasenko, Anastasia, Yamshchikov, Ivan P.
Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets; in addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this technical report, we present the detailed provenance of data assembling and the details of dataset filtering and curation. Being already used by such industry leaders as Anthropic and multiple LLM training projects, we believe that Common Corpus will become a critical infrastructure for open science research in LLMs.
- Oceania > New Zealand (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > Dominican Republic (0.04)
- (13 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > North America Government > United States Government (0.68)
The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models
Bommarito, Michael J II, Bommarito, Jillian, Katz, Daniel Martin
Practically all large language models have been pre-trained on data that is subject to global uncertainty related to copyright infringement and breach of contract. This creates potential risk for users and developers due to this uncertain legal status. The KL3M Data Project directly confronts this critical issue by introducing the largest comprehensive training data pipeline that minimizes risks related to copyright or breach of contract. The foundation of this project is a corpus of over 132 million documents and trillions of tokens spanning 16 different sources that have been verified to meet the strict copyright and licensing protocol detailed herein. We are releasing the entire pipeline, including 1) the source code to acquire and process these documents, 2) the original document formats with associated provenance and metadata, 3) extracted content in a standardized format, 4) pre-tokenized representations of the documents, and 5) various mid- and post-train resources such as question-answer, summarization, conversion, drafting, classification, prediction, and conversational data. All of these resources are freely available to the public on S3, Hugging Face, and GitHub under CC-BY terms. We are committed to continuing this project in furtherance of a more ethical, legal, and sustainable approach to the development and use of AI models.
- Europe > United Kingdom (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (3 more...)
- Research Report (0.82)
- Overview (0.67)
- Law > Statutes (1.00)
- Law > Intellectual Property & Technology Law (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
China's DeepSeek impresses. But is a 'fast follow' good enough in AI?
American stock markets shuddered on Monday, prompted by China's announcement that it has created a capable, cheap, artificial intelligence machine. It's the biggest cloud yet to darken the West's blue-sky enthusiasm over AI, calling into question the efficacy of America's export controls and the billions of dollars the United States is pouring into the technology's expensive cutting edge. China startup DeepSeek says its AI assistant uses less advanced chips than its rivals' models do, and it costs less to train. Unlike the West's billions, the Chinese model was developed for just 5.6 million, by one estimate. "Are we going to spend 500 billion to get to the frontier so that China can find a way to copy our homework for pennies on the dollar?"
- Asia > China > Beijing > Beijing (0.06)
- North America > United States > New York > New York County > New York City (0.05)
- Banking & Finance > Trading (1.00)
- Government > Regional Government > North America Government > United States Government (0.31)
Towards Best Practices for Open Datasets for LLM Training
Baack, Stefan, Biderman, Stella, Odrozek, Kasia, Skowron, Aviya, Bdeir, Ayah, Bommarito, Jillian, Ding, Jennifer, Gahntz, Maximilian, Keller, Paul, Langlais, Pierre-Carl, Lindahl, Greg, Majstorovic, Sebastian, Marda, Nik, Penedo, Guilherme, Van Segbroeck, Maarten, Wang, Jennifer, von Werra, Leandro, Baker, Mitchell, Belião, Julie, Chmielinski, Kasia, Fadaee, Marzieh, Gutermuth, Lisa, Kydlíček, Hynek, Leppert, Greg, Lewis-Jong, EM, Larsen, Solana, Longpre, Shayne, Lungati, Angela Oduor, Miller, Cullen, Miller, Victor, Ryabinin, Max, Siminyu, Kathleen, Strait, Andrew, Surman, Mark, Tumadóttir, Anna, Weber, Maurice, Weiss, Rebecca, White, Lee, Wolf, Thomas
Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.
- Asia > Japan (0.24)
- North America > United States > New York (0.04)
- Europe > France (0.04)
- Law > Intellectual Property & Technology Law (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Law > Litigation (0.88)